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Abstract 

Tracking  entities,  so  that  new  or  important 
information  about  that  entities  are  caught, 
is  a  real  challenge  and  has  many  applica¬ 
tions  (e.g.,  information  monitoring,  market¬ 
ing,...).  We  are  interesting  in  how  to  repre¬ 
sent  an  entity  profile  to  fulfill  two  purposes: 

1.  entity  detection  and  disambiguation, 

2.  novelty  and  importance  quantification. 

We  propose  an  entity  profile,  which  uses 
two  language  models.  First,  the  Reference 
Language  Model  (RLM),  which  is  mainly 
used  for  disambiguation.  Second,  we  pro¬ 
pose  a  formalization  of  a  Time- Aware  Lan¬ 
guage  Model,  which  is  used  for  novelty  de¬ 
tection.  To  rank  documents,  we  propose 
a  semi-supervised  classification  approach 
which  uses  meta-features  computed  on  doc¬ 
uments  using  entity  profiles  and  time  series. 

1  Introduction 

This  article  introduces  the  system  for  the  Knowledge 
Base  Acceleration  (KBA)  track  from  Text  REtrieval 
Conference  (TREC).  This  challenging  task  started 
in  2012  to  answer  a  need  in  Information  Retrieval. 
Many  documents  appear  everyday  on  the  Web.  Find¬ 
ing  relevant  documents  about  a  topic  may  be  a  dif¬ 
ficult  task  depending  on  the  definition  of  relevancy. 
The  KBA  track  focuses  on  filtering  documents  that 
are  centered  on  a  topic  while  ranking  them  accord¬ 
ing  to  whether  the  documents  carry  important  or  ad¬ 
ditional  information  about  the  topic. 

(Frank  et  ah,  2012)  showed  that  the  time  lag  be¬ 
tween  the  publication  date  of  cited  news  articles  and 
the  date  the  news  is  actually  written  onto  the  con¬ 
cerned  Wikipedia  article  can  be  really  big  (median 
356  days)  especially  for  non-popular  entities.  A  pos¬ 
sible  application  is  to  use  highly  ranked  documents 
as  suggestions  for  contributor  of  Wikipedia. 

The  KBA  track  is  divided  in  two  tasks:  CCR 
(Cumulative  Citation  Recommendation)  and  SSF 


(Streaming  Slot  Filling).  CCR  task  is  to  filter  out 
documents  worth  citing  in  a  profile  of  an  entity  (e.g., 
Wikipedia  or  freebase  article).  SSF  task  is  to  detect 
changes  on  the  given  slots  for  each  of  the  target  en¬ 
tities.  This  article  focuses  only  on  CCR  task. 

In  CCR  task,  the  system  is  to  filter  out,  from  a 
stream,  the  documents  relative  to  target  entities.  The 
system  must  also  be  able  to  give  the  usefulness  of 
a  document  ranked  using  one  of  those  4  relevance 
classes: 

-  garbage:  no  information  about  target  entity; 

-  neutral:  informative  but  not  citable; 

-  useful:  bio,  primary  or  secondary  source  useful 
when  creating  a  profile  from  scratch; 

-  vital:  timely  info  about  the  entity’s  current  state, 
actions,  or  situation. 

The  stream-corpus  contains  timestamped  docu¬ 
ments  crawled  from  newswires,  blogs,  forums,  re¬ 
views,.  . . .  The  stream  corpus  must  be  processed  in 
chronological  order  in  order  to  perform  real  life  fil¬ 
tering  simulation.  In  addition,  the  documents  rele¬ 
vancy  assessment  must  be  performed  as  soon  as  the 
document  appears  on  the  stream.  A  decision  cannot 
be  postponed.  Each  year  a  set  of  entities  is  selected 
by  organizers  and  a  set  of  documents  is  annotated 
according  to  the  selected  entities.  In  2014,  about 
30,000  documents  have  been  annotated  (8,000  can 
be  used  for  training  purposes). 

Our  approach  uses  semi-supervised  build  entity 
profile,  time  series  analysis  to  compute  a  set  of  meta¬ 
features  for  each  documents.  The  meta-features  are 
used  in  a  classification  system  to  determine  the  class 
of  the  documents  among  garbage,  neutral,  useful  and 
vital.  In  the  remaining  of  this  article  we  detail  the 
whole  concept  around  entity  profile,  then  we  de¬ 
scribe  the  different  meta-features  used  in  the  classifi¬ 
cation  system.  We  then  detail  the  different  strategies 
we  adopt.  We  eventually  discuss  about  our  experi¬ 
ments  onto  the  KBA  framework  and  the  results  from 
the  official  and  unofficial  KBA  submissions.  Unoffi¬ 
cial  KBA  submissions  comes  from  experiments  run 
after  the  official  submission  deadline. 
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2  Entity  Profiles 

The  term  entity  describes  a  single  and  unique  rep¬ 
resentation  of  a  person,  an  organization,  music 
band, _ Documents  refer  to  entities  using  their  sur¬ 

face  form  names.  An  entity  may  have  several  surface 
form  names  (e.g.,  Tim  Cook,  the  Apple  CEO,...).  In 
addition,  one  surface  form  may  be  used  for  several 
entities  (e.g.,  Boris  Berezovsky  the  business  man 
or  the  pianist).  Such  ambiguous  entities  are  called 
homonymous.  We  propose  a  filtering  system  based 
on  two  steps  filtering  method  for  each  document:  1. 
keep  the  document  only  if  an  occurrence  of  a  surface 
form  is  found  in  it;  2.  give  a  class  to  the  document 
for  each  entity  detected  in  step  1.  We  propose  an 
approach  that  uses  entity  profiles  as  well  as  a  classi¬ 
fication  system  to  perform  those  two  steps. 

2.1  Detecting  entities  within  documents 

The  first  step  of  the  filtering  system  is  aimed  to 
find  documents  that  contain  an  occurrence  of  the  en¬ 
tity.  (Cucerzan,  2007)  propose  an  approach  that  uses 
Wikipedia  to  build  an  entity  profile.  Given  the  en¬ 
tity  dedicated  Wikipedia  Page,  the  method  consists 
in  using  heuristics  and  knowledge  base  graph  explo¬ 
ration  to  extract:  a  language  model,  a  list  of  rela¬ 
tions  (all  entities  having  a  connection  to  the  entity 
dedicated  Wikipedia  Page)  and  a  list  of  surface  form 
names. 

The  most  intuitive  way  to  detect  an  entity  within 
a  document  is  to  find  occurrences  of  any  surface 
form.  We  propose  heuristics  to  automatically  build 
patterns  out  of  surface  form  names.  Those  heuris¬ 
tics  are  aimed  to  detect  acronyms  and  middle  names. 
We  use  the  notation  [  ]  to  surround  optional  words. 
We  use  *  to  announce  that  the  word  is  incomplete. 

B.N.S.F.  railway  =>  B*N*S*F*  railway 

Chad  R.  Kroeger  =>  Chad  [R*]  Kroeger 

We  use  this  notation  in  addition  to  the  (Cucerzan, 
2007)  approach  to  search  for  surface  forms  within 
the  Wikipedia  page  centered  on  the  entity.  For  in¬ 
stance,  the  system  can  now  detect  that  B.N.S.F.  rail¬ 
way  stands  for  Burlington  Northern  Santa  Fe  rail¬ 
way. 

2.2  Language  Models  in  Entity  Profile 

One  main  aim  of  the  entity  profile  is  to  help  in  entity 
disambiguation.  (Navigli,  2009)  shows  that  having 
a  context,  in  which  a  word  occurs  in,  helps  in  word 
sense  disambiguation.  The  same  observation  can  be 
transferred  to  entities.  (Sehgal  and  Srinivasan,  2007) 
define  an  entity  profile  as  a  language  model  build 
using  the  top-n  documents  found  on  Google.  Then, 
they  compare  the  obtained  result  with  the  Wikipedia 
page  corresponding  to  the  entity  and  obtained  good 


results.  However,  such  method  does  not  address  the 
homonymous  issue. 

(Efron,  2014)  proposed  a  method  to  update  the 
language  model  of  an  entity  profile  using  documents 
ranked  as  relevant  by  their  system.  However,  results 
were  impacted  badly.  Indeed,  updating  the  language 
model,  which  is  used  to  describe  the  entity,  may  lead 
to  a  topic  drift.  To  avoid  the  topic  drift,  we  define  an 
entity  profile  with  two  language  models  serving  two 
different  but  essential  purposes: 

-  the  Reference  Language  Model  (RLM)  gath¬ 
ers  information  aimed  to  help  identifying  the  entity 
within  documents.  The  language  model  is  a  unigram 
representation  where  each  word  is  associated  with 
a  probability.  To  completely  avoid  the  topic  drift, 
a  RLM  must  only  be  updated  with  manual  inputs. 
In  addition,  approaches  from  (Cucerzan,  2007;  Seh¬ 
gal  and  Srinivasan,  2007)  can  be  used  to  easily  build 
such  model. 

-  the  Time-Aware  Language  Model  (TALM) 

catches  a  representation  of  current  events  that  oc¬ 
curs  for  an  entity.  The  language  model  is  a  unigram 
representation  where  each  word  is  associated  with  a 
probability  and  a  timestamp.  Contrary  to  the  RLM, 
the  TALM  is  constantly  updated  using  documents 
ranked  as  relevant  to  an  entity.  We  use  the  time  com¬ 
ponent  and  a  sigmoid  function  to  forget  about  infor¬ 
mation  after  a  certain  time  laps.  We  think  that  two 
identical  events  can  appear  at  different  time  laps  and 
we  want  to  be  able  to  catch  both  of  thus.  Indeed  the 
fact  that  an  event  is  happening  again  after  a  period 
may  infer  that  something  important  is  happening  for 
the  entity  about  that  particular  event  (e.g,  someone’s 
getting  married  several  times). 

3  Language  Models  formalization 

3.1  The  Reference  Language  Model 

The  RLM  represents  the  knowledge  on  the  entity. 
Tthose  knowledge  helps  for  entity  disambiguation. 
We  propose  to  use  probabilities  from  the  RLM  to 
directly  compare  them  to  the  document  using  dis¬ 
tance,  like  the  cosine  similarity.  The  distance  score 
indicates  whether  the  context  is  similar  to  the  one 
described  in  the  RLM  or  not. 

Let  us  define  72  as  a  set  of  documents  such  as 
[di,...,dn  £  72.].  Let  tf(wi,dn)  the  function  that 
gives  the  number  of  occurrences  of  a  word  w-i  in  a 
document  dn.  We  define  df(u>i,lZ)  the  number  of 
time  a  word  Wi  occurs  in  the  language  model  72  such 
as: 

df(Wi,n )  =  E^=0  tf(wi,dn)  (1) 

Let  us  define  the  functions  len(dn)  the  number 


of  occurrences  of  each  words  [wi , Wi\  G  dn  and 
len(JZ)  the  number  of  occurrences  of  each  words 
[tt’i , ....  u)j\  G  1Z  such  as: 

len(dn )  =  J2i=o  dn) 

len(Jl)  =  EZo  len(dn) 

The  normalized  version  of  term  frequency  is  re¬ 
ferred  to  as  the  term  probability.  We  then  define: 


p(wi\dn) 

p(wi\U) 


tf(wj,dn ) 
len(dn) 
df(wj,1Z) 
len('JZ) 


(3) 


3.2  The  Time-Aware  Language  Model 

The  Time- Aware  Language  Model  (TALM)  searches 
for  novelty  about  an  entity.  The  TALM  aggregates 
information  from  documents  being  relevant  for  the 
entity.  However  gathering  too  much  information 
may  lead,  at  a  certain  point,  to  miss  novelty.  In  addi¬ 
tion,  a  Language  Model  with  too  many  information 
in  it  may  lead  to  a  drift.  We  design  the  TALM  so  that 
it  uses  a  time-aware  function  allowing  it  to  smoothly 
forget  old  documents.  A  time-aware  function  gives 
a  weight  according  to  two  events  e\  and  e2  having 
respectively  a  timestamp  te i  and  fe2  with  te  1  >  te 2. 
We  propose  V  a  time-aware  function  that  gives,  to 
a  word,  less  credit  if  it  was  seen  a  long  time  ago. 
The  amount  of  time  required  to  forget  about  an  in¬ 
formation  is  defined  using  a  constant  parameter  A  as 
follows: 


A t  =  J  *  ( tel  -  te 2) 

fl,  if  At  <  0 

V(tei,tel)  =  <  0,  if  At  >  1 

[  1+e(puit)-o.))  otherwise 

(4) 

Let  us  define  TA  a  time-aware  language  model 
made  up  of  a  set  of  timestamped  documents  such  as 
[dl  — >  tdi,---,dn  — ¥  tdn ]  G  TA.  We  use  A  as 
an  indicator  that  the  function  is  using  in  time-aware 
context.  Let  us  consider  dc  a  new  document  having 
a  timestamp  tc.  Let  tfA(wi,  dn .  tc)  a  function  that 
computes  the  number  of  occurrences  of  words  xut  in 
a  document  dn  while  considering  a  time  tc.  We  also 
define  dfA{wi,  tc,  R)  the  number  of  time  a  word  Wi 
occurs  in  TA  as  follows: 


tfA(wi,  dn,  tc)  =  T>(tc,tdn).count(wi\dn) 

dfA(Wi,TA,tc)  =  J2n=0  tfA(wi,dn,tc) 

(5) 

Considering  a  time  tc,  let  us  define  the  function 
lenA(dn,  tc )  the  number  occurrences  of  each  words 
[wi,...,Wj]  €  dn .  Let  us  define  lenA(TA,  tc)  the 


number  of  occurrences  of  each  words  [u'i , ... ,Wi ]  G 
Ta  as  follows: 

lenA(dn,  tc)  =  J2i=otfA(wi’dn,tc )  (6, 

lenA(TA,  tc)  =  J2n=olenA(dn,tc) 

Let  us  define  NA{TA ,tc)  the  number  of  docu¬ 
ments  considered  at  time  tc  and  idfA(wi,  t,  TA)  the 
inverse  documents  frequency  as  follows: 

NA(TA,tc)  =  En=l  V{tc,tdn) 
idf  )  —  log dfA(wi,t.,TA)+o.5 

To  define  the  term  probability  functions,  we  need 
to  consider  the  time  tWi  corresponding  to  the  last 
time  the  word  tc,  has  occurred  in  TA  ■  We  now  de¬ 
fine  pA(wi,twi,tc\dn)  and  p(Awit  twi,  tc\TA)  the 
term  probability  functions  as  follows: 


P  (Wi ,  tyjii  fc|^n) 

PA{wi,twi,tc\TA) 

4  Documents  classification  using 
meta-features 

In  the  previous  year  of  KBA,  many  systems  have 
been  using  meta-features  within  a  classification  sys¬ 
tem  (Bonnefoy  et  al.,  2013a;  Bonnefoy  et  al.,  2013b; 
Balog  et  al.,  2013;  Bouvier  and  Bellot,  2014).  Those 
study  show  that  some  meta-features  works  better 
than  others.  We  summarize  in  the  following  sub¬ 
sections  the  meta-features  we  have  been  using  as 
well  as  the  new  features  designed  with  our  new  en¬ 
tity  profile  representation. 

4.1  Entity  Disambiguation  meta-features 

The  entity  related  meta-features  are  aimed  to  quan¬ 
tify,  using  different  measures,  how  a  document  is 
relevant  to  the  entity.  In  the  first  filtering  step,  a 
document  is  selected  using  only  the  surface  form 
names.  However,  an  entity  can  be  ambiguous  and 
thus  a  document  may  contains  occurrences  of  sur¬ 
face  form  names  of  an  homonymous  entities. 

To  ensure  a  document  refers  to  the  target  entity, 
we  use  the  context  given  by  the  entity  profile  to  com¬ 
pute  the  following  features: 

-  The  Cosine  Similarity  is  computed  using  the  term 
frequency  tf(wi\V)  of  words  Wi  G  d[JlZ  given  the 
vector  representation  of  the  document  d  and  the  Ref¬ 
erence  Language  Model  (equation  9); 


—  aize(d„ 

=  T>(tc,twi). 


size(dn 

EL  P(» 


,twi,tc\dn 


(8) 


cos(d,  1Z) 


J2?=ltf(wi\d)-tf(wi\'R) 

\/Y.i=i  tf(wi\d)2.-s/Y,2=i  tf(wi\n)2 

(9) 

-  The  Surface  Forms  Term  Frequency  measures 
the  term  frequency  of  each  surface  forms  within  the 
document  and  the  title; 

-  Entity  Relations  Term  Frequency  measures  the 
term  frequency  of  each  relations  by  type  of  rela¬ 
tions  (incoming,  outgoing,  mutual)  extracted  from 
the  knowledge  graph  from  Wikipedia. 

4.2  Novelty  and  Importance  meta-features 

We  propose  to  use  the  Time- Aware  Language  Model 
(TALM)  we  formalize  in  section  3.2  to  catch  novelty 
by  using  different  known  measures  : 

-  Jensen  Shannon  Divergence  (JSD)  computes  di¬ 
vergence  between  two  vectors.  It  uses  a  third  vector 
Ai  resulting  from  averaging  the  dot  products  of  the 
two  vectors  to  compare.  Let  us  define  a  set  of  n 
words  Wi  £  d  (J  TA  considering  TALM  TA  and  a 
document  d  appearing  at  time  t,/,  JSD  can  be  written 
such  as: 

M  =  \*{d  +  TA) 

JSD  =  |  *  Er  PAiwiMTA)logpA{^ti^)A) 

+  |  *  /C"  p(wi\d)l°9  p(wi\M) 

(10) 

-  Time-Aware  Novelty  Score  given  by  (Karkali  et 
ah,  2014).  They  have  tested  different  approaches  to 
measure  novelty  on  real  world  dataset.  The  novelty 
score  that  outperforms  others  is  computed  using  a 
smoothed  version  of  the  well  known  tf.idf  weight¬ 
ing  scheme  with  time  components.  We  transcribe 
the  equation  so  that  we  can  use  it  with  the  TALM 
(equation  11). 

Burst  detection  have  been  used  in  event  detec¬ 
tions  or  forcasting  (Kleinberg,  2002;  Sakaki  et  ah, 
2010;  Weng  and  Lee,  2011).  It  has  been  shown  in 
(Amodeo  et  ah,  2011;  Peetz  et  ah,  2014;  Wang  et 
ah,  2007)  that  the  relevancy  of  search  results  can  be 
improved  using  timed  information  such  as  abnormal 
peaks  (bursts)  of  queries  in  log  files  or  of  keywords 
or  even  documents  related  to  an  entity  in  a  stream. 
There  are  diverse  reasons  to  explain  a  burst.  Figure 
1  shows  a  burst  when  an  important  event  occurs  con¬ 
cerning  the  entity  BNSF  Railway.  The  meta-features 
we  propose  to  use  for  importance  quantification  are: 

-  The  Kleinberg  Burst  measures; 

-  The  Elastic  Burst  measures  that  uses  wavelet 
trees  to  estimate  burst  strength; 


BNSF  Railway 


_  says  a  person  is  dead  after  being  struck 

by  one  of  its  freight  trains  near  Tacoma  late  last  night . 
The  railroad  says  the  train  was  traveling  from  Seattle 
to  Portland  when  it  hit  a  person  trespassing  on  BNSF 


Figure  1 :  Showing  a  burst  of  documents  correspond¬ 
ing  to  an  important  news  about  BNSF  Railway. 


5  Experimental  Setup 

The  Entity  profiles  are  build  as  a  pre-process  step  us¬ 
ing  a  dump  of  Wikipedia  from  january  2012  in  addi¬ 
tion  to  the  reference  hies  provided  for  each  entities. 
Thus,  each  entity  have: 

-  a  Reference  Language  Model  (RLM)  initialized. 
The  RLM  can  be  empty  if  no  reference  hie  has  been 
provided; 

-  a  set  of  relations  (incoming,  outgoing,  mutual) 
found  using  Wikipedia  knowledge  graph  explo¬ 
ration.  In  the  case  where  no  Wikipedia  page  were 
found  for  an  entity,  the  set  remains  empty; 

-  a  set  of  surface  forms  found  using  heuristics  from 
(Cucerzan,  2007)  and  the  pattern  recognition  intro¬ 
duced  in  section  2.1; 

-  an  empty  Time-Aware  Language  Model,  which 
is  filled  while  going  through  the  stream-corpus. 

Finally,  each  documents  [di, ...,  dn ]  £  S  from  the 
stream  S  is  processed  according  to  the  two  filtering 
steps: 

1.  The  document  dn  contains  an  occurrence  of  a 
surface  form  from  one  or  several  entities.  The 
document  is  evaluated  for  each  entity  detected 
in  it.  Otherwise,  the  document  is  not  evaluated; 

2.  For  each  entity  detected  in  the  document  dn , 
the  meta-features  are  computed  and  the  classi¬ 
fication  system  output  the  relevancy  of  the  doc¬ 
ument  as  well  as  a  confidence  score.  The  rel- 
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evancy  and  the  score  is  stored  in  the  final  run 
submissions. 

We  define  different  strategies  to  compute  meta¬ 
features.  Indeed,  each  entity  profile  is  made  up  of  a 
TALM  that  has  to  be  updated  with  documents.  Doc¬ 
uments  may  contain  noise  that  we  don’t  want  to  be 
reflected  in  the  TALM.  We  use  two  different  strate¬ 
gies  to  update  the  TALM: 

-  Update  with  Document  (UD):  the  TALM  is  up¬ 
dated  with  the  full  document; 

-  Update  with  Snippet  (US):  the  TALM  is  updated 
only  with  the  paragraph  that  contains  occurrences  of 
the  entity; 

-  No  Update  (NU):  the  TALM  (and  the  meta¬ 
features  associated  to  it)  are  not  used  in  order  to  see 
if  it  brings  any  value  to  our  system. 

For  the  classification  system,  we  use  a  Random 
Forrest  classifier  with  50  trees.  We  designed  four 
different  classification  strategies: 

-  the  first  strategy,  2STEPS ,  considers  the  problem 
as  a  binary  classification  problem  where  we  use  two 
classifiers  in  cascade.  The  first  one  Cgn/uv  is  to 
classify  among  two  classes:  Garbage/Neutral  and 
Useful/Vital.  For  documents  being  classified  as  Use¬ 
ful/Vital  a  second  classifier  Cjj/y  is  used  to  deter¬ 
mine  the  final  output  class  between  Useful  and  Vital; 

-  the  second  strategy,  SINGLE,  performs  directly  a 
classification  between  the  four  classes;  -  the  third 
strategy,  VvsAll,  trains  a  classifier  on  all  documents 
considering  only  two  classes  vital  and  others  (all 
classes  but  vital).  When  this  classifier  gives  a  non- 
vital  class,  the  SINGLE  method  is  used  to  determine 
another  class  among  Garbage,  Neutral  and  Useful;  - 
the  last  strategy,  MULTI,  uses  scores  emitted  by  all 
previous  classifiers  and  learns  the  best  output  class 
considering  all  classifier’s  scores  for  every  classes. 

We  submit  9  runs  where  each  run  explore  a  com¬ 
bination  of  update  strategy  combine  with  a  classifi¬ 
cation  strategy. 

6  Results  analysis 

We  propose  a  two  step  filtering  system  where  the 
first  step  is  aimed  to  keep  only  documents  having  an 
occurrence  of  a  surface  form  of  at  least  one  entity. 
To  measure  the  performance  of  the  first  step  we  draw 
the  table  1.  Those  results  show  that  we  obtain  satis¬ 
factory  performance  since  87%  of  documents  con¬ 
cerning  the  entities  are  found  with  only  about  6%  of 
error  rate. 


Found  and  in  truth  data:  87.10% 

Found  and  not  in  truth  data:  5.90% 

Not  found:  12.9% 


Table  1 :  First  filtering  step  results 


Systems 

F-measure  Vital 

NU 

UD 

US 

MULTI 

.321 

.326 

.316 

SINGLE 

.252 

.261 

.290 

2STEPS 

.248 

.304 

.292 

VvsAll 

.217 

.224 

.297 

F-measure  Vital+Useful 

MULTI 

.777 

.783 

.783 

SINGLE 

.764 

.779 

.784 

2STEPS 

.759 

.771 

.782 

VvsAll 

.690 

.692 

.720 

Table  2:  Scores  obtained  on  our  systems  MULTI, 
SINGLE,  2STEPS  and  VvsOthers  with  different  set¬ 
tings  NO-UPD,  UPD-DOC,  UPD-SNPT  for  vital 
and  vital+useful  classification 

The  second  filtering  step  consists  in  giving  a  class 
to  every  documents  kept  from  the  first  step.  For  the 
official  submission,  we  designed  9  different  strate¬ 
gies  and  we  obtain  the  results  summarized  in  ta¬ 
ble  2.  The  official  measure  is  the  f-measure  (har¬ 
monic  mean  of  precision  and  recall).  We  clearly  see 
that  getting  satisfactory  results  for  vital  document  is 
really  difficult.  However  we  obtain  almost  .80  of 
f-measure  on  filtering  useful  and  vital  documents. 
This  means  that  our  system  is  able  to  depict  that  a 
document  is  centered  on  an  entity  at  a  rate  of  80%. 

After  the  submission  deadline  we  found  a  bug  in 
our  first  filtering  step.  Some  patterns  were  not  work¬ 
ing  properly  then  some  documents  were  missing. 
After  running  the  system  again,  the  first  step  per¬ 
formances  have  then  increased.  We  obtained  similar 
f-measure  results  for  the  step  2. 

Found  &  in  truth  data  98.95%  +11.85% 

Found  &  not  in  truth  data  8.89%  +2.99% 

Not  found  1.05%  -11.85% 

Table  3:  First  filtering  step  results  with  bug  fixed  on 
surface  forms  patterns 

For  many  entities  we  have  just  a  few  information. 
We  wanted  to  measure  if  performance  could  be  in- 
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Figure  2:  Showing  a  burst  of  documents  corresponding  to  an  important  news  about  BNSF  Railway. 


crease  with  some  more  knowledge  for  all  entities. 
We  set  a  limit  of  5  reference  documents  for  each  en¬ 
tities.  Some  already  have  reference  documents  given 
by  KB  A  organizers.  We  add  up  to  5  useful  docu¬ 
ments  from  the  training  to  each  entity.  We  use  the 
first  5  documents  seen  for  each  entity.  By  doing  so, 
we  upgrade  the  profile  with  more  knowledge  while 
still  having  a  scalable  system.  We  obtained  the  re¬ 
sults  summarized  in  table  4.  As  we  can  see  perfor¬ 
mances  have  been  widely  increased  for  both  useful 
and  vital  filtering. 


Systems 

F-measure  Vital 

NU 

UD 

US 

MULTI 

.387 

.381 

.364 

SINGLE 

.346 

.337 

.307 

2STEPS 

.351 

.301 

.315 

V  vs  All 

.339 

.327 

.301 

F-measure  Vital+Useful 

MULTI 

.894 

.895 

.891 

SINGLE 

.902 

.892 

.893 

2STEPS 

.890 

.889 

.894 

V  vs  All 

.895 

.895 

.891 

Table  4:  Scores  obtained  on  our  systems  MULTI, 
SINGLE,  2STEPS  and  VvsOthers  with  different  set¬ 
tings  NO-UPD,  UPD-DOC,  UPD-SNPT  for  vital 
and  vital+useful  classification 

In  order  to  observe  the  impact  of  each  features 
on  the  classification,  we  look  at  the  Variable  Im¬ 
portance  (VI)  given  by  (Breiman,  2001).  The  VI 


indicate  how  significant  is  a  feature  in  classifica¬ 
tion  decision  by  randomly  changing  the  values  as¬ 
sociated  to  each  feature  (one  at  a  time)  and  ob¬ 
serving  the  out  of  bag  error.  We  show  from  fig¬ 
ure  2  that  the  Reference  Language  Model  (RLM) 
and  the  Time -Aware  Language  Model  (TALM)  are 
among  the  top  5  important  features.  In  addition, 
relations  discovered  on  the  Wikipedia  page  of  the 
entity  (OUT .RELATIONS)  are  also  very  decisive. 
Finding  known  relation  within  a  document  helps  dis¬ 
covering  Vital  information.  On  the  negative  side,  we 
noticed  that  burst  detection  does  not  really  helps  in 
finding  Vital  information  which  is  counter  intuitive. 

7  Conclusion  and  Perspectives 

To  conclude,  we  present  a  filtering  system  based  on 
two  filtering  steps.  We  demonstrate  that  the  first 
step  obtained  very  good  results  in  documents  pre¬ 
selection.  We  see  that  we  obtain  satisfactory  results 
based  on  current  KBA-Framework.  We  also  show 
that  the  results  could  be  widely  increased  when  hav¬ 
ing  more  knowledge  about  an  entity  while  still  hav¬ 
ing  a  scalable  system.  Finally  we  discovered  that 
the  meta-features  linked  to  the  Reference  Language 
Model  and  the  Time-Aware  Language  Model  were 
really  useful  in  vital  document  classification. 

We  noticed  that  burst  detection  is  not  always  a  re¬ 
liable  clue  depending  on  the  entity.  In  the  future, 
we  will  invest  on  whether  some  features  correspond 
more  to  some  entities  than  others  to  automatically 
choose  the  appropriate  system. 
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